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ABSTRACT 

Hadoop is currently the large-scale data analysis "hammer" 
of choice, but there exist classes of algorithms that aren't 
"nails", in the sense that they are not particularly amenable 
to the MapReduce programming model. To address this, 
researchers have proposed MapReduce extensions or alter- 
native programming models in which these algorithms can 
be elegantly expressed. This essay espouses a very differ- 
ent position: that MapReduce is "good enough", and that 
instead of trying to invent screwdrivers, we should simply 
get rid of everything that's not a nail. To be more specific, 
much discussion in the literature surrounds the fact that it- 
erative algorithms are a poor fit for MapReduce: the simple 
solution is to find alternative non-iterative algorithms that 
solve the same problem. This essay captures my personal 
experiences as an academic researcher as well as a software 
engineer in a "real- world" production analytics environment. 
From this combined perspective I reflect on the current state 
and future of "big data" research. 

Author's note: I wrote this essay specifically to be contro- 
versial. The views expressed herein are more extreme than 
what I believe personally, written primarily for the purposes 
of provoking discussion. If after reading this essay you have 
a strong reaction, then I've accomplished my goal :) 

1. INTRODUCTION 

MapReduce [T7] has become an ubiquitous framework for 
large-scale data processing. The Hadoop open-source im- 
plementation enjoys widespread adoption in organizations 
ranging from two-person startups to Fortune 500 companies. 
It lies at the core of an emerging stack for data analytics, 
with support from industry heavyweights such as IBM, Mi- 
crosoft, and Oracle. Among the advantages of MapReduce 
are the ability to horizontally scale to petabytes of data on 
thousands of commodity servers, easy-to-understand pro- 
gramming semantics, and a high degree of fault tolerance. 

MapReduce, of course, is not a silver bullet, and there 
has been much work probing its limitations, both from a 
theoretical perspective |30l [5] and empirically by exploring 
classes of algorithms that cannot be efficiently implemented 
with it [231 1121 156] . Many of these empirical studies take 
the following form: they present a class of algorithms for 
which the naive Hadoop solution performs poorly, expose 
it as a fundamental limitation of the MapReduce program- 
ming modelQ and propose an extension or alternative that 

1 Note that in this paper I attempt to be precise when referring to 



addresses the limitation. The algorithms are expressed in 
this new framework, and, of course, experiments show sub- 
stantial (an order of magnitude!) performance improvements 
over Hadoop. 

This essay espouses a very different position, that Map- 
Reduce is "good enough" (even if the current Hadoop im- 
plementation could be vastly improved). While it is true 
that a large class of algorithms are not amenable to Map- 
Reduce implementations, there exist alternative solutions 
to the same underlying problems that can be easily imple- 
mented in MapReduce. Staying in its confines allows more 
tightly-integrated, robust, end-to-end solutions to heteroge- 
neous large-data challenges. 

To apply a metaphor: Hadoop is currently the large-scale 
data processing hammer of choice. We've discovered that, in 
addition to nails, there are actually screws — and it doesn't 
seem like hammering screws is a good idea. So instead of try- 
ing to invent a screwdriver, let's just get rid of the screws. 
If there are only nails, then our MapReduce hammer will 
work just fine. To be specific, much discussion in the lit- 
erature surrounds the fact that iterative algorithms are not 
amenable to MapReduce: the (simple) solution, I suggest, is 
to avoid iterative algorithms! 

I will attempt to support this somewhat radical thesis by 
exploring three large classes of problems which serve as the 
poster children for MapReduce-bashing: iterative graph al- 
gorithms (e.g., PageRank), gradient descent (e.g., for train- 
ing logistic regression classifiers) , and expectation maximiza- 
tion (e.g., for training hidden Markov models, fc-means). I 
begin with vague and imprecise notions of what "amenable" 
and "good enough" mean, but propose a concrete objective 
with which to evaluate competing solutions later. 

This essay captures my personal experiences as an aca- 
demic researcher as well as a software engineer in a pro- 
duction analytics environment. As an academic, I've been 
fortunate enough to collaborate with many wonderful col- 
leagues and students on "big data" since 2007, primarily us- 
ing Hadoop to scale a variety of text- and graph-processing 
algorithms (e.g., information retrieval, statistical machine 
translation, DNA sequence assembly). Recently, I've just 
returned from spending an extended two-year sabbatical at 
Twitter "in the trenches" as a software engineer wrestling 
with various "big data" problems and trying to build scal- 
able production solutions. 

In earnest, I quip "throw away everything not a nail" 
tongue-in-cheek to make a point. More constructively, I 



MapReduce, the programming model, and Hadoop, the popular 
open-source implementations. 



suggest a two-pronged approach to the development of "big 
data" systems and frameworks. Taking the metaphor a bit 
further (and at the expense of overextending it): On the one 
hand, we should perfect the hammer we already have by im- 
proving its weight balance, making a better grip, etc. On the 
other hand, we should be developing jackhammers — entirely 
new "game changers" that can do things MapReduce and 
Hadoop fundamentally cannot do. In my opinion, it makes 
less sense to work on solving classes of problems for which 
Hadoop is already "good enough". 

2. ITERATIVE GRAPH ALGORITHMS 

Everyone's favorite example to illustrate the limitations of 
MapReduce is PageRank (or more generally, iterative graph 
algorithms). Let's assume a standard definition of a directed 
graph G = (V, E) consisting of vertices V and directed edges 
E, with S(vi) = {vj\(vi,Vj) G E} and P(v t ) = {vj\(vj,Vi) £ 
E} consisting of the set of all successors and predecessors 
of vertex Vi (outgoing and incoming edges, respectively). 
PageRank ,48] is defined as the stationary distribution over 
vertices by a random walk over the graph. That is, for each 
vertex i>; in the graph, PageRank computes the value PR(ui) 
representing the likelihood that a random walk will arrive 
at vertex Vi . This value is primarily induced from the graph 
topology, but the computation also includes a damping fac- 
tor d, which allows for random jumps to any other vertex 
in the graph. For non-trivial graphs, PageRank is gener- 
ally computed iteratively over multiple timesteps t using the 
power method: 

f l/\V\ ifi = 

PR(vi;t) = l 1 _ d PHQ^t-i) , „ (1) 

The algorithm iterates until either a user defined maximum 
number of iterations has completed, or the values sufficiently 
converge. One common convergence criterion is: 

^2\PR(vi;t)-PR(vi;t-l)\ <e (2) 

The standard MapReduce implementation of PageRank is 
well known and is described in many places (see, for exam- 
ple, [37] ). The graph is serialized as adjacency lists for each 
vertex, along with the current PageRank value. Mappers 
process all the vertices in parallel: for each vertex on the 
adjacency list, the mapper emits an intermediate key-value 
pair with the destination vertex as the key and the partial 
PageRank contribution as the value (i.e., each vertex dis- 
tributes its present PageRank value evenly to its successors). 
The shuffle stage performs a large "group by", gathering all 
key-value pairs with the same destination vertex, and each 
reducer sums up the partial PageRank contributions. 

Each iteration of PageRank corresponds to a MapReduce 
job0 Typically, running PageRank to convergence requires 
dozens of iterations. This is usually handled by a control 
program that sets up the MapReduce job, waits for it to 
complete, and then checks for convergence by reading in 
the updated PageRank vector and comparing it with the 
previous. This cycle repeats until convergence. Note that 
the basic structure of this algorithm can be applied to a 
large class of "message-passing" graph algorithms [391 I42| 
(e.g., breadth-first search follows exactly the same form). 

2 This glosses over the treatment of the random jump factor, 
which is not important for the purposes here, but see |37| . 



There is one critical detail necessary for the above ap- 
proach to work: the mapper must also emit the adjacency list 
with the vertex id as the key. This passes the graph struc- 
ture to the reduce phase, where it is reunited (i.e., joined) 
with the updated PageRank values. Without this step, there 
would be no way to perform multiple iterations. 

There are many shortcoming with this algorithm: 

• MapReduce jobs have high startup costs (in Hadoop, can 
be tens of seconds on a large cluster under load). This 
places a lower bound on iteration time. 

• Scale-free graphs, whose edge distributions follow power 
laws, often create stragglers in the reduce phase. The 
highly uneven distribution of incoming edges to vertices 
produces significantly more work for some reduce tasks 
(take, for example, the reducer assigned to sum up the 
incoming PageRank contributions to google.com in the 
webgraph). Note that since these stragglers are caused 
by data skew, speculative execution [17] cannot solve the 
problem. Combiners and other local aggregation tech- 
niques alleviate but do not fully solve this problem. 

• At each iteration, the algorithm must shuffle the graph 
structure (i.e., adjacency lists) from the mappers to the 
reducers. Since in most cases the graph structure is static, 
this represents wasted effort (sorting, network traffic, etc.). 

• The PageRank vector is serialized to HDFS, along with 
the graph structure, at each iteration. This provides ex- 
cellent fault tolerance, but at the cost of performance. 

To cope with these shortcomings, a number of extensions to 
MapReduce or alternative programming models have been 
proposed. Pregel [42] implements the Bulk Synchronous 
Parallel model [52] : computations are "vertex-centric" and 
algorithms proceed in supersteps with synchronization bar- 
riers between each. In the implementation, all state, includ- 
ing the graph structure, is retained in memory (with peri- 
odic checkpointing). HaLoop [12] is an extension of Hadoop 
that provides support for iterative algorithms by scheduling 
tasks across iterations in a manner that exploits data locality 
and by adding various caching mechanisms. In Twister [23] , 
another extension of Hadoop designed for iteration, interme- 
diate data are retained in memory if possible, thus greatly 
reducing iteration overhead. Prlter [5(5], in contrast, takes 
a slightly different approach to speeding up iterative com- 
putation: it prioritizes those computations that are likely to 
lead to convergence. 

All the frameworks discussed above share in supporting- 
iterative constructs, and thus elegantly solve one or more of 
the shortcomings of MapReduce discussed above. However, 
they all have one drawback: they're not Hadoop! The real- 
ity is that the Hadoop-based stack (e.g., Pig, Hive, etc.) has 
already gained critical mass as the data processing frame- 
work of choice, and there are non-trivial costs for adopting a 
separate framework just for graph processing or iterative al- 
gorithms. More on this point in Section[SJ For now, consider 
three additional factors: 

First, without completely abandoning MapReduce, there 
are a few simple "tweaks" that one can adopt to speed up 
iterative graph algorithms. For example, the Schimmy pat- 
tern [55] avoids the need to shuffle the graph structure by 
consistent partitioning and performing a parallel merge join 
between the graph structure and incoming graph messages 
in the reduce phase. The authors also show that great gains 



can be obtained by simple partitioning schemes that increase 
opportunities for partial aggregation. 

Second, some of the shortcomings of PageRank in Map- 
Reduce are not as severe as the literature would suggest. In 
a real- world context, PageRank (or any iterative graph algo- 
rithm) is almost never computed from scratch, i.e., initial- 
ized with a uniform distribution over all vertices and run un- 
til convergence. Typically, the previously-computed Page- 
Rank vector is supplied as a starting point on an updated 
graph. For example, in the webgraph context, the hyperlink 
structure is updated periodically from freshly-crawled pages 
and the task is to compute updated PageRank values. It 
makes little sense to re-initialize the PageRank vector and 
"start over". Initializing the algorithm with the previously- 
computed values significantly reduces the number of iter- 
ations required to converge. Thus, the iteration penalties 
associated with MapReduce become much more tolerable. 

Third, the existence of graph streaming algorithms for 
computing PageRank suggests that there may be non- 
iterative solutions (or at least approximations thereof) to a 
large number of iterative graph algorithms. This, combined 
with a good starting distribution (previous point), suggests 
that we can compute solutions efficiently, even within the 
confines of MapReduce. 

Given these observations, perhaps we might consider Map- 
Reduce to be "good enough" for iterative graph algorithms? 
But what exactly does "good enough" mean? Let's return 
to this point in Section [5] 

3. GRADIENT DESCENT 

Gradient descent (and related quasi-Newton) methods for 
machine learning represent a second large class of problems 
that are poorly suited for MapReduce. To explain, let's 
consider a specific type of machine learning problem, super- 
vised classification. We define X to be an input space and 
Y to be an output space. Given a set of training samples 
D — {(xj, i/i)}™ =1 from the space X x Y, the task is to in- 
duce a function / : X — > Y that best explains the training 
data. The notion of "best" is usually captured in terms of 
minimizing "loss", via a function I that quantifies the dis- 
crepancy between the functional prediction f(xi) and the 
actual output j/j, for example, minimizing the quantity: 
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which is known as the empirical risk. Usually, we consider 
a family of functions T (i.e., the hypothesis space) that is 
parameterized by the vector 0, from which we select: 
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indicates the rate of increase. Thus, if we "take a step" 
in the direction opposite to the gradient from an arbitrary 
point a, b = a — 7VL(a), then L(a) > L(b), provided that 7 
(known as the step size) is a small value greater than zero. 

If we start with an initial guess of 9^ and repeat the 
above process, we arrive at gradient descent. More formally, 
let us consider the sequence e (0) ,e m ,e {2) defined with 
the following update rule: 
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g(*+l) 



£(0 (O) ) > L{0 {L> ) > L 



g(*h 



(6) 



(7) 



,Vi) 



(4) 



where the sequence converges to the desired local minimum. 
If the loss function is convex and 7 is selected carefully 
(which can vary per iteration), we are guaranteed to con- 
verge to a global minimum. 

Based on the observation that our loss function decom- 
poses linearly, and therefore the gradient as well, the Map- 
Reduce implementation of gradient descent is fairly straight- 
forward. We process each training example in parallel and 
compute its partial contribution to the gradient, which is 
emitted as an intermediate key-value pair and shuffled to a 
single reducer. The reducer sums up all partial gradient con- 
tributions and updates the model parameters. Thus, each 
iteration of gradient descent corresponds to a MapReduce 
job. Two more items are needed to make this work: 

• Complete classifier training requires many MapReduce 
jobs to be chained in a sequence (hundreds, even thou- 
sands, depending on the complexity of the problem). Just 
as in the PageRank case, this is usually handled by a 
driver program that sets up a MapReduce job, waits for 
it to complete, and then checks for convergence, repeating 
as long as necessary. 

• Since mappers compute partial gradients with respect to 
the training data, they require access to the current model 
parameters. Typically, the parameters are loaded in as 
"side data" in each mapper (in Hadoop, either directly 
from HDFS or from the distributed cache). However, at 
the end of each iteration the parameters are updated, so 
it is important that the updated model is passed to the 
mappers at the next iteration. 

Any number of fairly standard optimizations can be ap- 
plied to increase the efficiency of this implementation, for 
example, combiners to perform partial aggregation or the 
in-mapper combining pattern [37]. As an alternative to per- 
forming gradient descent in the reducer, we can substitute a 
quasi-Newton method such as L-BFGS [IT] (which is more 
expensive, but converges in few iterations). However, there 
are still a number of drawbacks: 



That is, we learn the parameters of a particular model. In 
other words, machine learning is cast as a functional opti- 
mization problem, often solved with gradient descent. 

Rewriting Equation Q as argmin w L{9) simplifies our no- 
tation. The gradient of L, denote VL, is defined as follows: 
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The gradient defines a vector field pointing to the direction 
in which L is increasing the fastest and whose magnitude 



• As with PageRank, Hadoop jobs have high startup costs. 

• Since the reducer must wait for all mappers to finish (i.e., 
all contributions to the gradient to arrive), the speed of 
each iteration is bound by the slowest mapper, and hence 
sensitive to stragglers. This is similar to the PageRank 
case, except in the map phase. 

• The combination of stragglers and using only a single 
reducer potentially causes poor cluster utilization. Of 
course, the cluster could be running other jobs, so from a 
throughput perspective, this is only a minor concern. 



The shortcomings of gradient descent implementations in 
MapReduce have prompted researchers to explore alterna- 
tive architectures and execution models that address these 
issues. All the systems discussed previously in the context of 
PageRank are certainly relevant, but we point out two more 
alternatives. Spark [54] introduces the Resilient Distributed 
Datasets (RDD) abstraction, which provide a restricted form 
of shared memory based on coarse-grained transformations 
rather than fine-grained updates to shared state. RDDs can 
either be cached in memory or materialized from durable 
storage when needed (based on lineage, which is the se- 
quence of transformations applied to the data). Classifier 
training is one of the demo applications in Spark. Another 
approach with similar goals is taken by Bu et al. who 
translate iterative MapReduce and Pregel-style programs 
into recursive queries in Datalog. By taking this approach, 
database query optimization techniques can be used to iden- 
tify efficient execution plans. These plans are then executed 
on the Hyracks data-parallel processing engine [7]. 

In contrast to these proposed solutions, consider an alter- 
native approach. Since the bottleneck in gradient descent 
is the iteration, let's simply get rid of it! Instead of run- 
ning batch gradient descent to train classifiers, let us adopt 
stochastic gradient descent, which is an online technique. 
The simple idea is that instead of updating the model pa- 
rameters after only considering every training example, let 
us update the model after each training example (i.e., com- 
pute the gradient with respect to each example). 

Online learning techniques have received renewed interest 
in the context of big data since they operate in a stream- 
ing fashion and are very fast [101 1501 1351 [9]. In practice, 
classifiers trained using online gradient descent achieve accu- 
racy comparable to classifiers trained using traditional batch 
learning techniques, but are an order of magnitude (or more) 
faster to train [9]. 

Stochastic gradient descent addresses the iteration prob- 
lem, but does not solve the single reducer problem. For 
that, ensemble methods come to the rescue [191 I33J . In- 
stead of training a single classifier, let us train an ensemble 
of classifiers and combine predictions from each (e.g., sim- 
ple majority voting, weighted interpolation, etc.). The sim- 
plest way of building ensembles — training each classifier on 
a partition of the training examples — is both embarrassingly 
parallel and surprisingly effective in practice [431 144] . 

Combining online learning with ensembles addresses the 
shortcomings of gradient descent in MapReduce. As a case 
study, this is how Twitter integrates machine learning into 
Pig in a scalable fashion 38 : folding the online learning in- 
side storage functions and building ensembles by controlling 
data partitioning. To reiterate the argument: if MapReduce 
is not amenable to a particular class of algorithms, let's sim- 
ply find a different class of algorithms that will solve the 
same problem and is amenable to MapReduce. 

4. EXPECTATION MAXIMIZATION 

A third class of algorithms not amenable to MapReduce 
is expectation maximization (EM) [18] and EM-like algo- 
rithms. Since EM is related to gradient descent (both are 
first-order optimization techniques) and many of my argu- 
ments are quite similar, the discussion in this section will be 
more superficial. 

EM is an iterative algorithm that finds a successive se- 
ries of parameter estimates 9^°\ 9^\ . . . that improve the 



marginal likelihood of the training data, used in cases where 
there is incomplete (or unobservable) data. The algorithm 
starts with some initial set of parameters 9^ and then up- 
dates them using two steps: expectation (E-step), which 
computes the posterior distribution over the latent vari- 
ables given the observable data and a set of parameters 9^ , 
and maximization (M-step), which computes new param- 
eters maximizing the expected log likelihood of the 
joint distribution with respect to the distribution computed 
in the E-step. The process then repeats with these new 
parameters. The algorithm terminates when the likelihood 
remains unchanged. 

Similar to iterative graph algorithms and gradient descent, 
each EM iteration is typically implemented as a Hadoop 
job, with a driver to set up the iterations and check for 
convergence. In broad strokes, the E-step is performed in 
the mappers and the M-step is performed in the reducers. 
This setup has all the shortcomings discussed before, and 
EM and EM-like algorithms can be much more elegantly 
implemented in alternative frameworks that better support 
iteration (e.g., those presented above). 

Let's more carefully consider terms that I've been using 
quite vaguely: What does it mean for an algorithm to be 
amenable to MapReduce? What does it mean for Map- 
Reduce to be "good enough"? And the point of comparison? 
Here are two case studies that build up to my point: 

Dyer et al. [32] applied MapReduce to training transla- 
tion models for a statistical machine translation system — 
specifically, the word- alignment component that uses hid- 
den Markov models (HMMs) to discover word correspon- 
dences across bilingual corpora [51] . The point of compar- 
ison was GIZA++Q a widely-adopted in-memory, single- 
threaded implementation (the de facto standard used by re- 
searchers at the time the work was performed, and still com- 
monly used today). The authors built a Hadoop-based im- 
plementation of the HMM word- alignment algorithm, which 
demonstrated linear scalability compared to GIZA++, re- 
ducing per-iteration training time from hours to minutes. 
The implementation exhibited all the limitations associated 
with EM algorithms (high job startup costs, awkward pass- 
ing of model parameters from one iteration to the next, etc.) , 
yet compared to the previous single-threaded approach, Map- 
Reduce represented a step forwardQ Here is the key point: 
whether an algorithm is "amenable" to MapReduce is a rel- 
ative judgment that is only meaningful in the context of an 
alternative. Compared to GIZA++, the Hadoop implemen- 
tation represented an advance. However, this is not incon- 
sistent with the claim that EM algorithms could be more el- 
egantly implemented in an alternate model that better sup- 
ports iteration (e.g., any of the work discussed above). 

The second example is the venerable Lloyd's method for 
fe-means clustering, which can be understood in terms of 
EM (not exactly EM, but can be characterized as EM-like). 
A MapReduce implementation of fc-means shares many of 
the limitations discussed thus far. It is true that the algo- 
rithm can be expressed in a simpler way using a program- 
ming model with iterative constructs and executed more ef- 
ficiently with better iteration support (and indeed, many of 
the papers discussed above use fc-means as a demo appli- 

3 code. google. com/p/giza-pp/ 

4 HMM training is relatively expensive computationally, so job 
startup costs are less of a concern. Furthermore, these algorithms 
typically run for less than a dozen iterations. 



cation). However, even within the confines of MapReduce, 
there has been a lot of work on optimizing clustering al- 
gorithms (e.g., [161 124] V It is not entirely clear how these 
improvements would stack up against using an entirely dif- 
ferent framework. Here, is MapReduce "good enough"? 

These two case studies provide the segue to my attempt 
at more clearly defining what it means for MapReduce to be 
"good enough", and a clear objective for deciding between 
competing solutions. 

5. WHAT'S "GOOD ENOUGH"? 

I propose a pragmatic, operational, engineering-driven cri- 
terion for deciding between alternative solutions to large- 
data problems. First, though, my assumptions: 

• The Hadoop stack, for better or for worse, has already 
become the de facto general-purpose, large-scale data pro- 
cessing platform of choice. As part of the stack I include 
higher-level layers such as Pig and Hive. 

• Complete, end-to-end, large-data solutions involve hetero- 
geneous data sources and must integrate different types 
of processing: relational processing, graph analysis, text 
mining, machine learning, etc. 

• No single programming model or framework can excel at 
every problem; there are always tradeoffs between sim- 
plicity, expressivity, fault tolerance, performance, etc. 

Given these assumptions, the decision criterion I propose 
is this: in the context of an end-to-end solution, would it 
make sense to adopt framework X (HaLoop, Twister, Prlter, 
Spark, etc.) over the Hadoop stack for solving the problem at 
handu Put another way: are the gains gotten from using X 
worth the integration costs incurred in building the end-to- 
end solution? If no, then operationally, we can consider the 
Hadoop stack (including Pig, Hive, etc., and by extension, 
MapReduce) to be "good enough". 

Note that this way of thinking takes a broader view of 
end-to-end system design and evaluates alternatives in a 
global context. Considered in isolation, it naturally makes 
sense to choose the best tool for the job, but this neglects 
the fact that there are substantial costs in knitting together 
a patchwork of different frameworks, programming models, 
etc. The alternative is to use a common computing platform 
that's already widely adopted (in this case, Hadoop), even 
if it isn't a perfect fit for some of the problems. 

I propose this decision criterion because it tries to bridge 
the big gap between "solving" a problem (in a research pa- 
per) and deploying the solution in production (which has 
been brought into stark relief for me personally based on my 
experiences at Twitter). For something to "work" in produc- 
tion, the solution must be continuously running; processes 
need to be monitored; someone needs to be alerted when the 
system breaks; etc. Introducing a new programming model, 
framework, etc. significantly complicates this process — even 
mundane things like getting the data imported into the right 
format and results exported to the right location become 
non-trivial if it's part of a long chain of dependencies. 

A natural counter-argument would be: Why should aca- 
demics be concerned with these (mere) "production issues"? 

5 Hadoop is already a proven production system, whereas all the 
alternatives are at best research prototypes; let's even say for the 
sake of argument that X has already been made production ready. 



This ultimately comes down to what one's criteria for suc- 
cess are. For me personally, the greatest reward comes from 
seeing my algorithms and code "in the wild": whether it's 
an end-to-end user-facing service that millions are using on 
a daily basis or an internal improvement in the stack that 
makes engineers and data scientists' lives better. I consider 
myself incredibly lucky to have accomplished both during 
my time at Twitter. I firmly believe that in order for any 
work to have meaningful impact (in the way that I define 
it, recognizing, of course, that others are guided by differ- 
ent utility functions), how a particular solution fits into the 
broader ecosystem is an important consideration^ 

Different programming models provide different ways of 
thinking about the problem. MapReduce provides "map" 
and "reduce", which can be composed into more complex 
dataflows (e.g., via Pig). Other programming models are 
well-suited to certain types of problems precisely because 
they provide a different way of thinking about the problem. 
For example, Pregel provides a vertex-centered approach 
where "time" is dictated by the steady advance of the super- 
step synchronization barriers. We encounter an impedance 
mismatch when trying to connect different frameworks that 
represent different ways of thinking. The advantages of be- 
ing able to elegantly formulate a solution in a particular 
framework must be weighed against the costs of integrating 
that framework into an end-to-end solution. 

To illustrate, I'll present a hypothetical but concrete ex- 
ample: let's say we wish to run PageRank on the interaction 
graph of a social network (i.e., the graph defined by inter- 
actions between users) . Such a graph is implicit and needs 
to be constructed from behavior logs, which is natural to 
accomplish in a dataflow language such as Pig (in fact, Pig 
was exactly designed for log mining). Let's do exactly that. 

With the interaction graph now materialized, we wish to 
run PageRank. Consider two alternatives: use GiraphQ the 
open-source implementation of Pregel, or implement Page- 
Rank directly in Pig'0 The advantage of the first is that 
the BSP model implemented by Giraph/Pregel is perfect 
for PageRank and other iterative graph algorithms (in fact, 
that's exactly what Pregel was designed to do). The down- 
side is lots of extra "plumbing": munging Pig output into a 
format suitable for Giraph, triggering the Giraph job, wait- 
ing for it to finish, and figuring out what to do with the 
output (if another Pig job depends on the results, then we 
must munge the data back into a form that Pig can use)0 
In the second alternative, we simply write PageRank in Pig, 
with all the shortcomings of iterative MapReduce algorithms 
discussed in this paper. Each iteration might be slow due 
to stragglers, needless shuffling of graph structure, etc., but 
since we likely have the PageRank vector from yesterday 
to start from, the Pig solution would converge mercifully 
quickly. And with Pig, all of the additional "plumbing" is- 
sues go away. Given these alternatives, I believe the choice of 
the second is at least justifiable (and arguably, preferred), 



As a side note, unfortunately, the faculty promotion and tenure 
process at most institutions does not reward these activities, and 
in fact, some would argue actively disincentivizes these activities 
since they take time away from writing papers and grants. 

incubator.apache.org/giraph/ 
8 techblug. wordpress .com /2011/07/29/ pagerank- 
implementation-in-pig/ 

9 Not to mention all the error reporting, alerting, error handling 
mechanisms that now need to work across Pig and Giraph. 



and hence, in this particular context, I would argue that 
MapReduce is good enough. 

In my opinion, the arguments are even stronger for the 
case of stochastic gradient descent. Why adopt a separate 
machine-learning framework simply for running batch gra- 
dient descent when it could be seamlessly integrated into 
Pig by using stochastic gradient descent and ensemble meth- 
ods [38] ? This approach costs nothing in accuracy, but gains 
tremendously in terms of performance. In the Twitter case 
study, machine learning is accomplished by just another Pig 
script, which plugs seamlessly into existing Pig workflows. 

To recap: Of course it makes sense to consider the right 
tool for the job, but we must also recognize the cost asso- 
ciated with switching tools — in software engineering terms, 
the costs of integrating heterogeneous frameworks into an 
end-to-end workflow are non-trivial and should not be ig- 
nored. Fortunately, recent developments in the Hadoop 
project promise to substantially reduce the costs of inte- 
grating heterogeneous frameworks: Hadoop NextGen (aka 
YARN) introduces a generic resource scheduling abstraction 
that allows multiple application frameworks to co-exist on 
the same physical cluster. In this context, MapReduce is just 
one of many possible application frameworks; others include 
SparlQ and MPlE3 This "meta-framework" could poten- 
tially reduce the costs of supporting heterogeneous program- 
ming models — an exciting future development that might let 
us have our cake and eat it too. However, until YARN proves 
itself in production environments, it remains an unrealized 
potential. 

6. CONSTRUCTIVE SUGGESTIONS 

Building on the arguments above and reflecting on my 
experiences over the past several years working on "big data" 
in both academia and industry, I'd like to make the following 
constructive suggestions: 

Continue plucking low hanging fruit, or, refine the 
hammer we already have. I do not think we have yet suffi- 
ciently pushed the limits of MapReduce in general and the 
Hadoop implementation in particular. In my opinion, it 
may be premature to declare it obsolete and call for a fresh 
ground- up redesign 4,8. MapReduce is less than ten years 
old, and Hadoop is even younger. There has already been 
plenty of interesting work within the confines of Hadoop, 
just from the database perspective: integration with a tra- 
ditional RDBMS P [3], smarter task scheduling [55J 155] . 
columnar layouts [27] \M [25j [28] [29] , embedded indexes [20] 
121] . cube materialization [45] , and a whole cottage industry 
on efficient join algorithms [5] 1471 1361 131] ; we've even seen 
"traditional" HPC ideas such as work stealing make its way 
into the Hadoop context [M]. Much more potential remains 
untapped. 

The data management and distributed systems commu- 
nities have developed and refined a large "bag of tricks" 
over the past several decades. Researchers have tried ap- 
plying many of these in the Hadoop context (see above), 
but there are plenty remaining in the bag waiting to be ex- 
plored. Many, if not most, of the complaints about Hadoop 
lacking basic features or optimization found in other data 
processing systems can be attributed to immaturity of the 
platform, not any fundamental limitations. More than a 

10 github.com/mesos/spark-yarn 

11 issues.apache.org/jira/browse/MAPREDUCE-2911 



"matter of implementation", this work represents worthy re- 
search. Hadoop occupies a very different point in the design 
space when compared to parallel databases, so the "standard 
tricks" often need to be reconsidered in this new context. 

So, in summary, let's fix all the things we have a good idea 
how to fix in Hadoop (low-risk research), and then revisit 
the issue of whether MapReduce is good enough. I believe 
this approach of incrementally refining Hadoop has a greater 
chance of making impact (at least by my definition of impact 
in terms of adoption) than a strategy that abandons Hadoop. 
To invoke another cliche: let's pluck all the low-hanging fruit 
first before climbing to the higher branches. 

Work on game-changers, or, develop the jackhammer. 
To displace (or augment) MapReduce, we should focus on 
capabilities that the framework fundamentally cannot sup- 
port. To me, faster iterative algorithms, illustrated with 
PageRank or gradient descent aren't "it" — given my above 
arguments on how for those, MapReduce is "good enough". I 
propose two potential game changers that reflect pain points 
I've encountered during my time in industry: 

First, real-time computation on continuous, large-volume 
streams of data is not something that MapReduce is ca- 
pable of. MapReduce is fundamentally a batch process- 
ing framework — and despite efforts in implementing "online" 
MapReduce [T5J, I believe solving the general problem re- 
quires something that looks very different from the current 
architecture. For example, let's say I want to keep track 
of the top thousand most-clicked URLs posted on Twitter 
in the last n minutes. The current solution is to run batch 
MapReduce jobs with increasing frequency (e.g., every five 
minutes), but there is a fundamental limit to this approach 
(job startup time), and (near) real-time results are not ob- 
tainable (for example, if I wanted up-to-date results over the 
last 30 seconds). 

One sensical approach is to integrate a stream process- 
ing engine — a stream-oriented RDBMS (e.g., [131 1261 [32]), 
S4 [46], or StorrrQ — with Hadoop, so that the stream pro- 
cessing engine handles real-time computations, while Hadoop 
performs aggregate "roll ups". More work is needed along 
these lines, and indeed researchers are already beginning 
to explore this general direction [14] . I believe the biggest 
challenge here is to seamlessly and efficiently handle queries 
across vastly-different time granularities: from "over the past 
30 seconds" (in real time) to "over the last month" (where 
batch computations with some lag would be acceptable). 

Second, and related to the first, real-time interactions with 
large datasets is a capability that is sorely needed, but is 
something that MapReduce fundamentally cannot support. 
The rise of "big data" means that the work of data scientists 
is increasingly important — after all, the value of data lie in 
the insights that they generate for an organization. Tools 
available to data scientists today are primitive: Write a Pig 
script and submit a job. Wait five minutes for the job to 
finish. Discover that the output is empty because of the 
wrong join key. Fix simple bug. Resubmit. Wait another 
five minutes. Rinse, repeat. It's fairly obvious that long 
debug cycles hamper rapid iteration. To the extent that 
we can provide tools to allow rich, interactive, incremental 
interactions with large data sets, we can boost the produc- 
tivity of data scientists, thereby increasing their ability to 
generate insights for the organization. 
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Open source everything. Open source releasing of soft- 
ware should be the default for any work that is done in the 
"big data" space. Even the harshest critic would concede 
that open source is a key feature of Hadoop, which facil- 
itates rapid adoption and diffusion of innovation. The vi- 
brant ecosystem of software and companies that exist today 
around Hadoop can be attributed to its open source license. 

Beyond open sourcing, it would be ideal if the results 
of research papers were submitted as patches to existing 
open source software (i.e., associated with JIRA tickets). 
An example is recent work on distributed cube materializa- 
tion |45| . which has been submitted as a patch in PigE3 Of 
course, the costs associated with this can be substantial, but 
this represents a great potential for collaborations between 
academia and industry; committers of open source projects 
(mostly software engineers in industry) can help shepherd 
the patch. In many cases, transitioning academic research 
projects to production-ready code make well-defined sum- 
mer internships at companies. These are win-win scenarios 
for all: the company benefits immediately from new features; 
the community benefits from the open sourcing; and the stu- 
dents gain valuable experience. 

7. CONCLUSION 

The cliche is "if all you have is a hammer, then everything 
looks like a nail". I argue for going one step further: "if all 
you have is a hammer, throw away everything that's not a 
nail"! It'll make your hammer look amazingly useful. At 
least for some time. Soon or later, however, the flaws of 
the hammer will be exposed — but let's try to get as much 
hammering done as we can before then. While we're ham- 
mering, though, nothing should prevent us from developing 
jackhammers. 
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